Data Mining And Data Warehousing
k-means clustering algorithm
K-means Clustering Algorithm
K-Means is a popular unsupervised learning algorithm used for clustering, it partitions a dataset into k (k is number of clusters needed) distinct, non-overlapping groups (clusters) based on similarity.
Working of K-means
- Consider a dataset with n data points and a desired number of clusters k:
- Initialize:
- Choose k cluster centroids randomly.
- Assignment Step:
- Assign each data point to the nearest centroid (based on Euclidean distance).
- Update Step:
- Recalculate the centroids as the mean of all points assigned to each cluster.
- Repeat:
- Steps 2 and 3 are repeated until:Centroids no longer move significantly (convergence), or
A maximum number of iterations is reached.
Advantages
- Fast and efficient for large datasets.
- Easy to implement and interpret.
- Works well with spherical, well-separated clusters.
Limitations
- Must specify k beforehand.
- Sensitive to outliers and initial centroids.
- Assumes clusters are isotropic (uniform in all directions) and equally sized.
- Poor performance on non-convex clusters or clusters of different densities.